6 research outputs found

    complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

    Get PDF
    Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance

    The impact of parameter optimization of ensemble learning on defect prediction

    Get PDF
    Machine learning algorithms have configurable parameters which are generally used with default settings by practitioners. Making modifications on the parameters of machine learning algorithm is called hyperparameter optimization (HO) performed to find out the most suitable parameter setting in classification experiments. Such studies propose either using default classification model or optimal parameter configuration. This work investigates the effects of applying HO on ensemble learning algorithms in terms of defect prediction performance. Further, this paper presents a new ensemble learning algorithm called novelEnsemble for defect prediction data sets. The method has been tested on 27 data sets. Proposed method is then compared with three alternatives. Welch's Heteroscedastic F Test is used to examine the difference between performance parameters. To control the magnitude of the difference, Cliff's Delta is applied on the results of comparison algorithms. According to the results of the experiment: 1) Ensemble methods featuring HO performs better than a single predictor; 2) Despite the error of triTraining decreases linearly, it produces errors at an unacceptable level; 3) novelEnsemble yields promising results especially in terms of area under the curve (AUC) and Matthews Correlation Coefficient (MCC); 4) HO is not stagnant depending on the scale of the data set; 5) Each ensemble learning approach may not create a favorable effect on HO. To demonstrate the prominence of hyperparameter selection process, the experiment is validated with suitable statistical analyzes. The study revealed that the success of HO which is, contrary to expectations, not depended on the type of the classifiers but rather on the design of ensemble learners

    How repeated data points affect bug prediction performance: A case study

    No full text
    In defect prediction studies, open-source and real-world defect data sets are frequently used. The quality of these data sets is one of the main factors affecting the validity of defect prediction methods. One of the issues is repeated data points in defect prediction data sets. The main goal of the paper is to explore how low-level metrics are derived. This paper also presents a cleansing algorithm that removes repeated data points from defect data sets. The method was applied on 20 data sets, including five open source sets, and area under the curve (AUC) and precision performance parameters have been improved by 4.05% and 6.7%, respectively. In addition, this work discusses how static code metrics should be used in bug prediction. The study provides tips to obtain better defect prediction results. (C) 2016 Elsevier B.V. All rights reserved

    A novel defect prediction method for web pages using k-means plus

    No full text
    With the increase of the web software complexity, defect detection and prevention have become crucial processes in the software industry. Over the past decades, defect prediction research has reported encouraging results for reducing software product costs. Despite promising results, these researches have hardly been applied to web based systems using clustering algorithms. An appropriate implementation of the clustering in defect prediction may facilitate to estimate defects in a web page source code. One of the widely used clustering algorithms is k-means whose derived versions such as k-means++ show good performance on large-data sets. Here, we present a new defect clustering method using k-means++ for web page source codes. According to the experimental results, almost half of the defects are detected in the middle of web pages. k-means++ is significantly better than the other four clustering algorithms in three criteria on four data set. We also tested our method on four classifiers and the results have shown that after the clustering, Linear Discriminant Analysis is, in general, better than the other three classifiers. (C) 2015 Elsevier Ltd. All rights reserved
    corecore